Although it can be challenging, some recent innovations have lowered the learning curve. RStudio is an example. While R is the underlying statistical software, RStudio is an Integrated Development Environment (IDE) for R. It provides features such as the capability to view help files, plots, and the workspace environment on the same screen in which you are scripting, minimizing the need to switch back and forth between screens. It also offers a Graphical User Interface (GUI) set of menu options to perform some basic tasks such as opening and saving scripts and opening and saving your workspace.
The image below displays roughly what you see when you open up RStudio. The four panes might be in a slightly different order. In order to get this, you might have to select File > New File > R Script from the menu option.
picture
Once you do that, the panel at the top left is your script editor. This is where you type your commands that you will send to R. Think of it like a painter’s sketch pad; you try out some different ideas and when you like them, you put the ideas onto the actual canvas.
In R, there are multiple ways to execute commands from your script. The best way is to hit shift-CMD-S, which runs your entire script file sequentially. Alternatively, you can hit ctl-CMD-B to run all the lines up to where your cursor is. Lastly, you can send commands line-by-line (the cursor’s position dictates which line is executed) or you can selected a section of lines and run that code. Both work by entering CMD-return.
The top right panel shows you the workspace you are working in and the bottom right panel shows you a combination of things, primarily it shows you any graphs you produce (plots) and help files you want to read.
Commands in R take the form of command(x,y) with arguments like x and y to each command inside the parentheses. See what happens when you type the following in your script editor and then press CMD-Return (or CNTL-ENTER)
print('Hello world')
## [1] "Hello world"
The results of any command can be saved with R’s assignment operator <-. The objects on the left receive – and become aliases for – what is on the right. It’s an awkward thing it’s true: one way to make it easier to remember is to interpret strings with the operator as left gets right. So in the following command object abc gets the results of print(hello world).
abc<-print('Hello world')
abc
## [1] "Hello world"
install.packages('tidyverse')
install.packages('haven')
install.packages('pastecs')
install.packages('car')
install.packages('psych')
install.packages('stargazer')
install.packages('xtable')
install.packages('ggeffects')
There have been some remarkable and important innovations inside R recently that have improved thr language, made it more powerful and more user-friendly. These are a set of packages designed by Hadley Wickham of the University of Auckland and RStudio. Collectively this set of pakckages has come to be known as the tidyverse. Wickham uses the term ‘tidy’ to refer to data that is stuctured such that each row is an observation and each column is a variable. There is a particular logic associated with the tidyverse that makes it worthwhile to mark it off from what is known as base R.
Running library(tidyverse) will load the following packages:
ggplot2, for data visualisation.dplyr, for data manipulation.tidyr, for data tidying.readr, for data import.purrr, for functional programming.tibble, for tibbles, a modern re-imagining of data frames.library(tidyverse)
In addition, we’ll need the haven package from the tidyverse to load data into R
library(haven)
And then, we’ll need a few more libraries from base R to do our work.
There are multiple types of data within the R environment and we’ll start by working through some of the different ways that data are stored.
Numbers are stored as `numeric’ data.
num<-1
Character strings (e.g. letters and numbers stored as characters, not numbers) have to be single-quoted. Note, in general it’s best to use single-quotes when dealing with single strings. When combining multiple strings, double-quotes should be used.
char<-'1'
You can compare what both look like by printing both to the screen.
num
## [1] 1
char
## [1] "1"
You can also use the class() command to check what class an object is.
class(num)
## [1] "numeric"
class(char)
## [1] "character"
Objects of the same class can be stored in a vector of objects. Vectors are formed with the c() command (e.g. concatenate), using commas to separate each element in the vector.
gender<-c(0,1,0)
age<-c(18,22,33)
lucky<-c(1,2,3)
Matrices are combinations of numeric vectors (and only numeric vectors) in rows and columns. They can be formed with the command matrix().
Combine age, gender and lucky into one matrix called matrix1 with 3 columns and n rows using thematrix()command, thencol=and/or thenrow=` arguments.
R has a powerful way of accessing elements inside vectors and matrices. With single dimension matrices (i.e. vectors), a number inside square brakcets [i] returns the ith element of that vector.
Get your own age from age. Get your own gender from gender.
In matrices, the first position in the square brackets returns the row and the second position in the square brackets returns the columns.This is an important convention in R and I’m not aware of any function or command where this is reveresed: first rows, then columns.
1. Get the first row of matrix1 2. then get the second row 3. Then return the first column of matrix1
You can also get a range of consecutive columns with the : notation. And you can get a selection of columns with a vector of columns, by using the c() argument inside the square brackets.
1. Return the first through third columns of matrix1.
2. Return the first and third columns of matrix1.
Data frames are special types of matrices. They are also stored in rows and columns, but they allow for mixing classes of variables (i.e. numeric and character vectors can be stored alongside each other). They also explicitly assign variable names to columns. These can be accessed with the $ immediately following the data frame’s name, rather than just the column number.
Put more glibly, matrices have rows and columns of numbers; data frames have observations (in the rows) and variables (in the columns).
data frames are constructed with the command data.frame(variable_name=variable, variable_name=variable).
df<-data.frame(age=age, gender=gender, lucky=lucky)
df$age
## [1] 18 22 33
df$gender
## [1] 0 1 0
One of the most important things to do on creating a data frame or reading data into R is to inspect it to get a feel for it. The following commands are useful and should be used often!
#summarize the first few rows of a data frame
head(df)
## age gender lucky
## 1 18 0 1
## 2 22 1 2
## 3 33 0 3
#Examine the structure of a data frame
str(df)
## 'data.frame': 3 obs. of 3 variables:
## $ age : num 18 22 33
## $ gender: num 0 1 0
## $ lucky : num 1 2 3
#Summary statistics of a data frame.
summary(df)
## age gender lucky
## Min. :18.00 Min. :0.0000 Min. :1.0
## 1st Qu.:20.00 1st Qu.:0.0000 1st Qu.:1.5
## Median :22.00 Median :0.0000 Median :2.0
## Mean :24.33 Mean :0.3333 Mean :2.0
## 3rd Qu.:27.50 3rd Qu.:0.5000 3rd Qu.:2.5
## Max. :33.00 Max. :1.0000 Max. :3.0
#Access one variable of a data frame
df$age
## [1] 18 22 33
You can also check the class of each individual variable.
#Check class of each variable
class(df$age)
## [1] "numeric"
class(df$gender)
## [1] "numeric"
class(df$lucky)
## [1] "numeric"
Factors are particular types of character vectors. Factors are R's term for categorical variables. Storing character vectors as factors allows the user to instruct R to treat it like a categorical variable. The key attribute that factor have is levels, which are like the value labels in SPSS. There is a key difference between SPSS value labels and factor levels, however, which we will return to below.
In our example of genders and ages, we want to convert the 1s in the gender variable to female and the 0s to males. car’s recode() command is the ticket. It is one of the most versatile and frequently used packages in R. Get used to it, quickly!
library(car)
recode(df$gender, "0='male' ; 1='female'", as.factor=T)
## [1] male female male
## Levels: female male
df$gender2<-recode(df$gender,
"0='male' ; 1='female'",
as.factor=T)
levels(df$gender2)
## [1] "female" "male"
list1<-list(age, gender,lucky)
#print the numeric variable num as a character
as.character(num)
## [1] "1"
#print the character variable char as a number
as.numeric(char)
## [1] 1
#Convert both to a factor (categorical variable)
as.factor(num)
## [1] 1
## Levels: 1
as.factor(char)
## [1] 1
## Levels: 1
#Addition
x<-1+2
x
## [1] 3
y<-2+1
#Division
x/y
## [1] 1
#Multiplication
x*y
## [1] 9
#Exponentiation
x^2
## [1] 9
#Square Root
sqrt(9)
## [1] 3
#Note: the capitalization patterns for functions.
#Words are not separated with periods or underscores, but with capitalized names.
#The arguments that each function takes are specified in parentheses.
functionName<-function(x,y,z){ #The actual commands are opened with a curly brace command
#The function commands are entered here
}# The function is closed here
test<-function(i) {
print(i)
}
test(7)
## [1] 7
test(as.character(7))
## [1] "7"
?as.factor
#Read the file in
ces<-read_sav(file=
'http://lispop.ca/sites/lispop.ca/files/ces_2015.sav')
#Different ways of summarizing the data set
#Check the first few rows
head(ces)
## # A tibble: 6 x 10
## age newspapers TV radio internet education gender income vote
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+l>
## 1 56 2 7 7 7 4 2 145 1
## 2 22 NA NA NA NA 3 1 NA <NA>
## 3 49 NA NA NA NA 4 1 130 <NA>
## 4 38 1 5 1 7 2 1 60 1
## 5 70 6 5 3 0 3 2 57 1
## 6 47 NA NA NA NA 4 1 NA <NA>
## # ... with 1 more variable: vote2 <dbl+lbl>
#The last few rows
tail(ces)
## # A tibble: 6 x 10
## age newspapers TV radio internet education gender income vote
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+l>
## 1 42 0 0 1 7 4 1 NA 1
## 2 46 3 7 1 7 4 1 NA 1
## 3 84 7 7 7 0 3 1 NA 1
## 4 39 5 7 7 2 2 1 72 1
## 5 42 NA NA NA NA 4 2 NA <NA>
## 6 50 NA NA NA NA 3 2 120 <NA>
## # ... with 1 more variable: vote2 <dbl+lbl>
#Check the structure of the object
str(ces)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4202 obs. of 10 variables:
## $ age : atomic 56 22 49 38 70 47 74 61 63 45 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ newspapers: atomic 2 NA NA 1 6 NA NA 3 4 4 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ TV : atomic 7 NA NA 5 5 NA NA 5 7 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ radio : atomic 7 NA NA 1 3 NA NA 5 3 4 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ internet : atomic 7 NA NA 7 0 NA NA 4 0 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ education :Class 'labelled' atomic [1:4202] 4 3 4 2 3 4 3 1 4 2 ...
## .. ..- attr(*, "format.spss")= chr "F8.0"
## .. ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
## .. .. ..- attr(*, "names")= chr [1:4] "less than high school" "completed high school" "college, some university" "university"
## $ gender :Class 'labelled' atomic [1:4202] 2 1 1 1 2 1 1 2 1 2 ...
## .. ..- attr(*, "format.spss")= chr "F8.0"
## .. ..- attr(*, "labels")= Named num [1:2] 1 2
## .. .. ..- attr(*, "names")= chr [1:2] "Female" "Male"
## $ income : atomic 145 NA 130 60 57 NA 40 32 NA 150 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ vote :Class 'labelled' atomic [1:4202] 1 NA NA 1 1 NA NA 2 1 1 ...
## .. ..- attr(*, "label")= chr "Did YOU vote in the election?"
## .. ..- attr(*, "format.spss")= chr "F8.0"
## .. ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
## .. .. ..- attr(*, "names")= chr [1:4] "Yes" "No" "Don't know" "Refused"
## $ vote2 :Class 'labelled' atomic [1:4202] 2 NA NA 1 1 NA NA NA NA 1 ...
## .. ..- attr(*, "format.spss")= chr "F8.0"
## .. ..- attr(*, "labels")= Named num [1:6] 1 2 3 4 5 6
## .. .. ..- attr(*, "names")= chr [1:6] "Liberal" "Conservatives" "NDP" "Bloc Quebecois" ...
#Summarize the data set
summary(ces)
## age newspapers TV radio
## Min. : 18.00 Min. :0.00 Min. :0.000 Min. :0.00
## 1st Qu.: 45.00 1st Qu.:0.00 1st Qu.:3.000 1st Qu.:1.00
## Median : 58.00 Median :2.00 Median :6.000 Median :5.00
## Mean : 56.43 Mean :3.12 Mean :4.835 Mean :4.17
## 3rd Qu.: 68.00 3rd Qu.:7.00 3rd Qu.:7.000 3rd Qu.:7.00
## Max. :115.00 Max. :7.00 Max. :7.000 Max. :7.00
## NA's :135 NA's :1217 NA's :1216 NA's :1223
## internet education gender income
## Min. :0.000 Min. :1.000 Min. :1.000 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 50.00
## Median :3.000 Median :3.000 Median :1.000 Median : 80.00
## Mean :3.357 Mean :2.982 Mean :1.459 Mean : 96.91
## 3rd Qu.:7.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:129.00
## Max. :7.000 Max. :4.000 Max. :2.000 Max. :900.00
## NA's :1227 NA's :36 NA's :1607
## vote vote2
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :2.000
## Mean :1.051 Mean :1.958
## 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :2.000 Max. :6.000
## NA's :1221 NA's :1710
#See the variables names
names(ces)
## [1] "age" "newspapers" "TV" "radio" "internet"
## [6] "education" "gender" "income" "vote" "vote2"
As referred to above, the haven package has a labelled class which is meant to be an intermediate class to mirror the way that software like SPSS and Stata store variable meta data. It is not meant to be used in R.
There are two ways to deal with the labelled class. You can use the function as_factor() to turn a variable from a labelled variable into a standard categorical variable. Watch what happens below. The function table() to produce a frequency table of the variable gender.
table(ces$gender)
##
## 1 2
## 2272 1930
Now wrap ces$gender in as_factor().
table(as_factor(ces$gender))
##
## Female Male
## 2272 1930
The labelled class is not meant to be used throughout an R analysis. It is an intermediate class, meant to be converted to one of the other R classes quickly.
All we do is write over the original variables with the results of as_factor().
ces$gender<-as_factor(ces$gender)
ces$education<-as_factor(ces$education)
ces$vote<-as_factor(ces$vote)
ces$vote2<-as_factor(ces$vote2)
Now look what has happened to the factor variables.
str(ces)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4202 obs. of 10 variables:
## $ age : atomic 56 22 49 38 70 47 74 61 63 45 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ newspapers: atomic 2 NA NA 1 6 NA NA 3 4 4 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ TV : atomic 7 NA NA 5 5 NA NA 5 7 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ radio : atomic 7 NA NA 1 3 NA NA 5 3 4 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ internet : atomic 7 NA NA 7 0 NA NA 4 0 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ education : Factor w/ 4 levels "less than high school",..: 4 3 4 2 3 4 3 1 4 2 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 1 1 2 1 1 2 1 2 ...
## $ income : atomic 145 NA 130 60 57 NA 40 32 NA 150 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ vote : Factor w/ 4 levels "Yes","No","Don't know",..: 1 NA NA 1 1 NA NA 2 1 1 ...
## ..- attr(*, "label")= chr "Did YOU vote in the election?"
## $ vote2 : Factor w/ 6 levels "Liberal","Conservatives",..: 2 NA NA 1 1 NA NA NA NA 1 ...
A slightly quicker way, albeit with some downfalls is that as_factor actually can be used for an entire data frame. The trick is you have to be sure that all your labelled variables are actually factors. Some might actually be numeric. Here, we know that all the labelled variables are in fact, factors. Note: When dealing with a data set like the CES, this will not be the case!
ces<-as_factor(ces)
Note: these exercises really only apply to reading data in from SPSS or Stata format. If you have more raw data, as in a csv file, or perhaps your own entered data from your own experiments, these steps will not be necessary. AS such, these steps may be particular to people working with large, cross-sectional data in fields like political science, sociology, economics and public health, rather than in experimental fields like psychology, medicine or other natural sciences.
Often we want to subset particular rows or select particular variables. Again, there are two ways to do this. Suppose we want to split the new out data frame into men and women. We can refresh our memories of what variables we are looking at by using lapply() and the attr() command.
The relevant command is filter().
Suppose we want to split the data frame into two data frames by men and women.
men<-filter(ces, gender=='Male')
women<-filter(ces, gender=='Female')
The filter() command is one of the innovations from the tidyverse. Using this example, we could go further illustrating some important innovations in the tidyverse to which we will return later. The first innovation is the pipe operator %>% that is used to chain commands together. It goes on the right-hand-side of a command and feeds the results of the function on the left-hand-side to the right. Watch:
#Take the data frame we want to filter
men<-ces %>% #use the pipe to send the results to the next command
filter(., gender=='Male') #The . is a placeholder; it just mens use the data that has come from the pipe.
Now we can chwck what the results are.
#see the male data set
str(men)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1930 obs. of 10 variables:
## $ age : atomic 56 70 61 45 29 58 52 57 63 64 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ newspapers: atomic 2 6 3 4 7 1 1 0 0 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ TV : atomic 7 5 5 7 5 7 5 5 7 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ radio : atomic 7 3 5 4 5 3 6 3 7 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ internet : atomic 7 0 4 7 7 2 7 7 7 0 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ education : Factor w/ 4 levels "less than high school",..: 4 3 1 2 3 3 2 4 3 3 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ income : atomic 145 57 32 150 90 120 NA 90 70 40 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ vote : Factor w/ 4 levels "Yes","No","Don't know",..: 1 1 2 1 1 1 2 1 1 1 ...
## ..- attr(*, "label")= chr "Did YOU vote in the election?"
## $ vote2 : Factor w/ 6 levels "Liberal","Conservatives",..: 2 1 NA 1 1 1 NA 1 1 3 ...
Sometimes, though we don’t want to pick out certain observations, we only want certain variables. Here, the relevant command is select(). Let’s select all the media consumption variables. We can actually do something interesting with them.
There are multiple ways to do this.
#Remind ourselves of the names of variables
names(ces)
#select by listing variable names
media<-ces %>%
select(., radio, internet, TV, newspapers)
#select by using hte position number. We want variables 2:5
media<-ces %>%
select(., 2:5)
#Combine the colon and the names
media<-ces %>%
select(., radio:newspapers)
#check
str(media)
Inevitably on R help boards, someone poses the question, `how do I construct afor()loop in R''. The answer is always a reference to theapply()` family of commands. There are three core functions and two lesser used ones. The three core functions are.
apply() applies a function to each row or column of a matrix or data framelapply() lapply applies a function to each element of a vector, data frame or matrix and returns a list the result of which is the result of applying functionsapply() is a wrapper for lapply that returns either a vector, or a matrix. Does not return a list.Here, we can use the apply() command to calculate the average overall media consumption of each person.
#apply, 1, applies function to each row, 2 applies function to each column
#apply(data_frame, 1, function_name)
#when there are no arguments to the function
#apply(data_frame_name, 1, function_name)
##
##when there is an argument you need to supply to a function,
##you can use any letter to serve as a variable
#apply(data_frame, 1, function(x) function_name(x))
apply(media, 1, function(x) mean(x))
#Now we just need to figure out how to save it.
####Assignment 4 Reattach the results from apply(media, 1, function(x) mean(x)) to the data frameces`
Now we can check what ces looks like:
str(ces)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4202 obs. of 11 variables:
## $ age : atomic 56 22 49 38 70 47 74 61 63 45 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ newspapers: atomic 2 NA NA 1 6 NA NA 3 4 4 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ TV : atomic 7 NA NA 5 5 NA NA 5 7 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ radio : atomic 7 NA NA 1 3 NA NA 5 3 4 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ internet : atomic 7 NA NA 7 0 NA NA 4 0 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ education : Factor w/ 4 levels "less than high school",..: 4 3 4 2 3 4 3 1 4 2 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 1 1 2 1 1 2 1 2 ...
## $ income : atomic 145 NA 130 60 57 NA 40 32 NA 150 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ vote : Factor w/ 4 levels "Yes","No","Don't know",..: 1 NA NA 1 1 NA NA 2 1 1 ...
## ..- attr(*, "label")= chr "Did YOU vote in the election?"
## $ vote2 : Factor w/ 6 levels "Liberal","Conservatives",..: 2 NA NA 1 1 NA NA NA NA 1 ...
## $ average : num 5.33 NA NA 2.33 4.67 ...
#Plotting This is where the fun begins.
Base R’s graphics are built around the plot() command. The tidyverse’s ggplot2 plotting function is much more systematic and built on fairly consistent principles.
The ggplot() command is the basic command which contains the basic parameters of a graph and then specifics are added to it. Below, we specify the underlying data – out – and the two most important aesthetics of a graph, the x and the y variable. Note: in ggplot2 ``aesthetics’’ has a very particular meaning. Namely, it maps particular visual features of a plot (i.e. color, shape of points, type of line, the fill of bars) to the values of variables (see the figure below).
ces %>%
ggplot(., aes(x=income))
No plot!
But nothing appears. This is because the second core aspect of the plot is the `geom</tt. These are the graphical types of representing the relationship between x and y (see the next figure).
#Add histogram
ces %>%
ggplot(., aes(x=income))+
geom_histogram()
Histogram
Making bivariate plots is just a matter of adding in a y-variable for the y-axis.
ces %>%
ggplot(., aes(x=income, y=average))+
geom_point()
Scatterplot
Very quickly, we can also add in grouping variables. Like education and / or gender. We can do this multiple ways.
aFirst, we could color the points by level of education. The key is to provide map the color aesthetic of the points to the variable ``education’’.
ces %>%
ggplot(., aes(x=income, y=average))+
geom_point(aes(col=education))
Points are colored by level of education.
Note that we have some missing values (NA) in level of education. This is somewhat annoying. We can remove this quite quickly in the plot by adding in a filter() command in the chain of plotting commands. The is.na() function takes a single variable. For each case, it returns TRUE if is missing and FALSE if is not missins. So, we want to filter the dataset to include all the cases that are not missing data for education.
ces %>%
filter(is.na(education)==F) %>%
ggplot(., aes(x=income, y=average, group=education))+
geom_point(aes(col=education))
Missing values are removed
Notice, though, when you run the following command, nothing is printed.
plot1<-ces %>%
filter(is.na(education)==F) %>%
ggplot(., aes(x=income, y=average, group=education))+
geom_point(aes(col=education))
But when you call plot1, it is printed.
plot1
The amazing thing about R in general and what has been a major driver in it s success is that every single part of the plot is modifiable. Try doing that in Excel or SPSS. For example, we can change the axis labels and main title with the command labs().
plot1+labs(x='Average Household Income', y='Average Days Of News Consumption', title='Average Media Consumption By Income')
We can also add in another grouping variable, say gender. But rather than changing the colors or the shapes, we could make two different sub-graphs. One for men, one for women. These are known as facets. You can use the command facet_wrap or facet_grid. Inside both, you use what is known as a ~ formula notation in R. We will come back to this when we deal with linear models. For now just know that the row variable goes on the right and the column variable goes on the left of the tilde. If you only have a variable to split into one or the other, just put it on the right side.
ces %>%
filter(is.na(education)==F) %>%
ggplot(., aes(x=income, y=average, group=education))+
geom_point(aes(col=education))+
facet_wrap(~gender)
Facet by gender
You can also add a line-of-best fit, directly to the plots using geom_smooth(). Geom_smooth() can fit linear lines of best fit from a simple regression of y on x. Or it can fit locally-weighted (lowess) lines of best fit.
ces %>%
filter(is.na(education)==F) %>%
ggplot(., aes(x=income, y=average))+
geom_point(aes(col=education))+
facet_wrap(~gender)+
geom_smooth(method='lm', aes(col=education))
Smooth
Notice, though that it only produces smooths for the whole male and female data sets. It doesn’t produce separate smooths broken down by education. Looking above, try to figure out how to do that.
Produce a plot that shows income on the x-axis, average media consumption on the y-axis, grouped by color for education, facetted by gender and with linear smooths (lines of best fit) for each level of education.
The other thing that is apparent is that there are a few data points at the high end of the income scale that are distorting the relationship a little bit. Recreat the plot above but by filtering the data set so that it only includes normal people (let’s say people earning less than $150,000)
Produce the same plot, but filter out those who earn more than $150,000.
PLotting categorical data is done usually with bar charts of various types. Because the y-axis is usually a count (sometimes it is a percentage), a y-axis is not usually specified.
ces %>%
ggplot(., aes(x=vote2))+
geom_bar()
Vote Choice in 2015
Notice we probably want to filter out the NAs. Those are probably people who did not vote.
ces %>%
filter(., is.na(vote2)==F) %>%
ggplot(., aes(x=vote2))+
geom_bar()
To add in a grouping variable, we can facet:
ces %>%
filter(., is.na(vote2)==F) %>%
ggplot(., aes(x=vote2))+
geom_bar()+
facet_wrap(~gender)
Vote Choice by Gender
Or we can add a fill with color. The argument, position='dodge' puts the bars side by side.
ces %>%
filter(is.na(vote2)==F) %>%
ggplot(., aes(x=vote2))+
geom_bar(aes(fill=gender), position='dodge')
Vote Choice With Color Fill By Gender
ggplot() really starts to sing, however, when we combine it with some of the newest innovations in R, namely thedplyr and the tidyrpackage. To introduce these, it is crucial to understand the difference between long and wide data.
In long (tidy) data, each column is one variable and each row is one observation. Currently our data is a mix of long and wide data.
head(ces)
## # A tibble: 6 x 11
## age newspapers TV radio internet education gender income vote
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <fct>
## 1 56 2 7 7 7 university Male 145 Yes
## 2 22 NA NA NA NA college, some… Female NA <NA>
## 3 49 NA NA NA NA university Female 130 <NA>
## 4 38 1 5 1 7 completed hig… Female 60 Yes
## 5 70 6 5 3 0 college, some… Male 57 Yes
## 6 47 NA NA NA NA university Female NA <NA>
## # ... with 2 more variables: vote2 <fct>, average <dbl>
In particular, the tv, newspapers, radio, internet, variables are actually all different observations on the same underlying variable, which is media consumption. In tidyr, thegather()` function takes columns that you would like to collapse into long format. The syntax is gather(data_frame, name_of_new_variable, name_of_value_variable, variables_to_be_gathered).
#Refresh our memory of what we are working with
names(ces)
## [1] "age" "newspapers" "TV" "radio" "internet"
## [6] "education" "gender" "income" "vote" "vote2"
## [11] "average"
#Gather the Media consumption Variables and values into a Media variable and a Days Variable
ces.out<-ces %>%
gather(., Media, Days, c(2:5,11))
Now, produce a scatterplot of with income on the x-axis , media consumption on the y axis facetted by different types of media variables. For fun, see if you can add in a smooth for each facet.
#Formal Statistical Tests in R Beyond data visualization, R is fully equipped to conduct basically all formal statistical tests.
Are newspaper consumption and age correlated?
#Pearson's correlation
cor(ces$newspapers, ces$age, use='complete.obs', method=c('pearson'))
## [1] 0.2847393
The function cor.test() can also produce significanc tests for correlations.
cor.test(ces$newspapers, ces$age, use='complete.obs', method=c('pearson'))
##
## Pearson's product-moment correlation
##
## data: ces$newspapers and ces$age
## t = 16.065, df = 2925, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2510996 0.3176920
## sample estimates:
## cor
## 0.2847393
Is there a difference between level of education and proclivity to vote? Key to testing hypotheses about hte relationships between two categorical variables is producing a table of counts with the table() function.
#produce table
tab_vote_gender<-table(ces$vote, ces$gender)
#print
tab_vote_gender
##
## Female Male
## Yes 1481 1348
## No 69 83
## Don't know 0 0
## Refused 0 0
Note that we have a small problem with the variable vote. Let’s just get rid of those levels that have no cases.
#print levels of ces$vote
levels(ces$vote)
## [1] "Yes" "No" "Don't know" "Refused"
Now, drop the levels and save the results back in ces$vote. We are just overwriting it.
ces$vote<-droplevels(ces$vote)
Now, rewrite the table.
tab_vote_gender<-table(ces$vote, ces$gender)
tab_vote_gender
##
## Female Male
## Yes 1481 1348
## No 69 83
The basic function is chisq.test()
chisq.test(tab_vote_gender)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab_vote_gender
## X-squared = 2.5245, df = 1, p-value = 0.1121
This is the classic experimental type of data result. There are many ways to come at this. First, let’s see if we can just calculate the group means on our own. Let’s see if we can get the group means of media consumption by level of education. This is such a common thing to do in data analysis, that this is one of the tidyverse’s innovations, we use the command group_by().
Notice how it lists gender as a group.
#Start with the data
ces %>%
#Now we want to make groups to calculate group means.
group_by(gender)
## # A tibble: 4,202 x 11
## # Groups: gender [2]
## age newspapers TV radio internet education gender income vote
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <fct>
## 1 56 2 7 7 7 university Male 145 Yes
## 2 22 NA NA NA NA college, som… Female NA <NA>
## 3 49 NA NA NA NA university Female 130 <NA>
## 4 38 1 5 1 7 completed hi… Female 60 Yes
## 5 70 6 5 3 0 college, som… Male 57 Yes
## 6 47 NA NA NA NA university Female NA <NA>
## 7 74 NA NA NA NA college, som… Female 40 <NA>
## 8 61 3 5 5 4 less than hi… Male 32 No
## 9 63 4 7 3 0 university Female NA Yes
## 10 45 4 7 4 7 completed hi… Male 150 Yes
## # ... with 4,192 more rows, and 2 more variables: vote2 <fct>,
## # average <dbl>
What we want to do now is to calculate the group average for each group that we have defined.
The key function here is summarize(). It kind of makes sense, we are summarizing each gropu with a mean. Summarize creates a new data frame. One variable will be the groups made by our grouping command and the second variable will store the results of the calculation we tell it to conduct.
ces %>% #Use the magrittr symbol to connect ces to the following line
#Now we want to make groups to calculate group means.
group_by(gender) %>%
summarize(avg=mean(newspapers, na.rm=TRUE))
## # A tibble: 2 x 2
## gender avg
## <fct> <dbl>
## 1 Female 3.05
## 2 Male 3.20
Now, we can go further and actually plot sent those results to actually plot the group means!
ces %>% #Use the magrittr symbol to connect ces to the following line
#Now we want to make groups to calculate group means.
group_by(gender) %>%
summarize(avg=mean(newspapers, na.rm=TRUE)) %>%
ggplot(., aes(x=gender, y=avg))+geom_point()+labs(title='Average Newspaper Consumption by gender')
We can also even add in the confidence intervals for the group means. The summarize command can use most mathematical and statistical commands to summarize groups. We use the n() function to count the number of cases per geoup, the sd() function to return the standard deviation and the formula n/sqrt(sd) to calculate the standard error. Note that the summarize function is able to use column variables (i.e. n) immediately in subsequent functions.
out<-ces %>%
group_by(gender) %>%
summarize(avg=mean(newspapers, na.rm=TRUE), n=n(), sd=sd(newspapers, na.rm=T), se=sd/sqrt(n))
Now we can print the plot using the geom geom_errorbar()
out %>%
ggplot(., aes(x=gender, y=avg))+geom_point()+labs(title='Average Newspaper Consumption by gender')+geom_errorbar( aes(ymin=avg-(1.96*se), ymax=avg+(1.96*se)) )
We can actually conduct the formal t.test as follows:
t.test(newspapers~gender, data=ces)
##
## Welch Two Sample t-test
##
## data: newspapers by gender
## t = -1.3814, df = 2965.3, p-value = 0.1673
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.35891612 0.06221727
## sample estimates:
## mean in group Female mean in group Male
## 3.049001 3.197350
This is obviously the workhorse analysis in the social sciences. It is performed with the function lm().
We can start by modelling the relationship between age and media consumption.
#we can give it any name we want
model1<-lm(average~age, data=ces)
#summary produces useful results
summary(model1)
##
## Call:
## lm(formula = average ~ age, data = ces)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0010 -1.3184 0.0245 1.3417 4.3448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.847565 0.121352 15.22 <2e-16 ***
## age 0.038457 0.002064 18.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.746 on 2916 degrees of freedom
## (1284 observations deleted due to missingness)
## Multiple R-squared: 0.1064, Adjusted R-squared: 0.1061
## F-statistic: 347.2 on 1 and 2916 DF, p-value: < 2.2e-16
You can add it further covariates.
model2<-lm(average~age+income, data=ces)
summary(model2)
##
## Call:
## lm(formula = average ~ age + income, data = ces)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8999 -1.2760 0.0504 1.2774 4.0603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4165025 0.1740959 8.136 7.22e-16 ***
## age 0.0413106 0.0026775 15.429 < 2e-16 ***
## income 0.0031982 0.0005611 5.699 1.39e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.716 on 1931 degrees of freedom
## (2268 observations deleted due to missingness)
## Multiple R-squared: 0.1115, Adjusted R-squared: 0.1106
## F-statistic: 121.2 on 2 and 1931 DF, p-value: < 2.2e-16
Including categorical variables:
model3<-lm(average~age+income+education, data=ces)
summary(model3)
##
## Call:
## lm(formula = average ~ age + income + education, data = ces)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0122 -1.2401 0.0789 1.2602 4.1934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9998864 0.2303266 4.341 1.49e-05
## age 0.0425152 0.0027096 15.691 < 2e-16
## income 0.0028013 0.0005848 4.790 1.79e-06
## educationcompleted high school 0.2636095 0.1736239 1.518 0.12911
## educationcollege, some university 0.4419988 0.1602674 2.758 0.00587
## educationuniversity 0.4605357 0.1608421 2.863 0.00424
##
## (Intercept) ***
## age ***
## income ***
## educationcompleted high school
## educationcollege, some university **
## educationuniversity **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.714 on 1926 degrees of freedom
## (2270 observations deleted due to missingness)
## Multiple R-squared: 0.1157, Adjusted R-squared: 0.1134
## F-statistic: 50.42 on 5 and 1926 DF, p-value: < 2.2e-16
We can also test for interactions.
model4<-lm(average~age+income+education+age:income, data=ces)
summary(model4)
##
## Call:
## lm(formula = average ~ age + income + education + age:income,
## data = ces)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9637 -1.2374 0.0816 1.2588 4.1613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.169e+00 3.056e-01 3.825 0.000135
## age 3.964e-02 4.356e-03 9.101 < 2e-16
## income 1.052e-03 2.158e-03 0.488 0.625855
## educationcompleted high school 2.537e-01 1.740e-01 1.458 0.145077
## educationcollege, some university 4.301e-01 1.609e-01 2.673 0.007577
## educationuniversity 4.468e-01 1.617e-01 2.763 0.005780
## age:income 3.341e-05 3.969e-05 0.842 0.399980
##
## (Intercept) ***
## age ***
## income
## educationcompleted high school
## educationcollege, some university **
## educationuniversity **
## age:income
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.714 on 1925 degrees of freedom
## (2270 observations deleted due to missingness)
## Multiple R-squared: 0.1161, Adjusted R-squared: 0.1133
## F-statistic: 42.13 on 6 and 1925 DF, p-value: < 2.2e-16
It is important to be able to visualize the effects of coefficients, particularly for interaction terms. In R, this is really easy, especially without interaction terms. The workhorse function is predict(). The key is to provide a data frame of new, sample data that is used to generate predicted values from a linear model.
So, first we can construct a data frame that includes the mean income, the mean age and the ou different levels of education. Note that the variables in the new data frame must be named exactly as in the underlying model.
#create the new data frame
newdat<-data.frame(income=mean(ces$income, na.rm=T), age=mean(ces$age, na.rm=T), education=levels(ces$education))
#show
newdat
## income age education
## 1 96.91137 56.4266 less than high school
## 2 96.91137 56.4266 completed high school
## 3 96.91137 56.4266 college, some university
## 4 96.91137 56.4266 university
Now, feed this to our model (model 3) with the predict function
#generate predicted values
preds<-predict(model3, newdat, se.fit=T, interval='confidence')
#show
Now we want to combine these values with the levels of education.
preds<-data.frame(preds, education=levels(ces$education))
#show
preds
## fit.fit fit.lwr fit.upr se.fit df residual.scale
## 1 3.670347 3.384986 3.955708 0.14550349 1926 1.71365
## 2 3.933957 3.740567 4.127347 0.09860811 1926 1.71365
## 3 4.112346 3.980946 4.243746 0.06699969 1926 1.71365
## 4 4.130883 4.009563 4.252203 0.06186001 1926 1.71365
## education
## 1 less than high school
## 2 completed high school
## 3 college, some university
## 4 university
Then we can plot.
preds %>%
ggplot(., aes(x=education, y=preds$fit.fit))+geom_point()+
labs(x='Level of education', y='Predicted Days Of News Consumption')+
geom_errorbar(aes(ymin=fit.lwr, ymax=fit.upr))
One thing that is annoying is that when we made the factor
education, R automatically coded levels alphabetivally. But there is an order to them. So we can use the command factor() to just change the order o the levels with the levels argument. Note, however, that the levels that are specified must be identical to the levels of the underlying variable. Otherwise unmatched levels will turn into NA!
levels(preds$education)
## [1] "college, some university" "completed high school"
## [3] "less than high school" "university"
preds$education<-factor(preds$education, levels=c('less than high school', 'completed high school', 'college, some university', 'university'))
preds %>%
ggplot(., aes(x=education, y=preds$fit.fit))+geom_point()+
labs(x='Level of education', y='Predicted Days Of News Consumption')+
geom_errorbar(aes(ymin=fit.lwr, ymax=fit.upr))
There is a quicker way to do this, which is to use the ggeffects library
#install
install.packages('ggeffects')
Load the library
library(ggeffects)
The command ggeffect() requires you to specify the term for which you are interested in varying the effects. In its bare form it looks like this.
preds<-ggpredict(model3, terms='age')
#show
preds
## # A tibble: 10 x 5
## x predicted conf.low conf.high group
## <int> <dbl> <dbl> <dbl> <fct>
## 1 10 1.70 1.30 2.09 1
## 2 20 2.12 1.76 2.48 1
## 3 30 2.55 2.22 2.88 1
## 4 40 2.97 2.67 3.28 1
## 5 50 3.40 3.11 3.69 1
## 6 60 3.83 3.54 4.11 1
## 7 70 4.25 3.96 4.54 1
## 8 80 4.68 4.38 4.98 1
## 9 90 5.10 4.78 5.42 1
## 10 100 5.53 5.18 5.87 1
The really nice thing about this package (and it’s kind of brand new) is that the results are immediately formatted for plotting in ggplot2. Note geom_ribbon() creates a band that can represent a confidence interval that is more suitable for line plots than geom_errorbar()
preds %>%
ggplot(., aes(x=x,y=predicted))+geom_point()+geom_line()+
geom_ribbon(aes(ymin=conf.low, ymax=conf.high), alpha=0.25)
We can also provide up to two other grouping variables that will produce predicted values. We can specify exactly what values are plotted by using the [] notation. Note, there has to be a space between the variable name and the square bracket.
#Generate Predicted Values
preds<-ggpredict(model3, terms=c('age', 'income [25,50,75,100]', 'education'))
#show
preds
## # A tibble: 160 x 6
## x predicted conf.low conf.high group facet
## <int> <dbl> <dbl> <dbl> <fct> <fct>
## 1 10 1.50 1.09 1.90 25 less than high school
## 2 10 1.76 1.42 2.10 25 completed high school
## 3 10 1.94 1.65 2.23 25 college, some university
## 4 10 1.96 1.66 2.26 25 university
## 5 10 1.57 1.17 1.96 50 less than high school
## 6 10 1.83 1.50 2.16 50 completed high school
## 7 10 2.01 1.73 2.28 50 college, some university
## 8 10 2.03 1.74 2.31 50 university
## 9 10 1.64 1.24 2.03 75 less than high school
## 10 10 1.90 1.58 2.22 75 completed high school
## # ... with 150 more rows
And then we can plot:
preds %>%
ggplot(., aes(x=x, y=predicted))+
geom_point(aes(col=group))+
facet_wrap(~facet)
matrix(c(age, gender, lucky), nrow=3, ncol=3)
## [,1] [,2] [,3]
## [1,] 18 0 1
## [2,] 22 1 2
## [3,] 33 0 3
#The number in the square brackets indicates which element of the vector gender to return
age[1]
gender[1]
matrix1[1,]
matrix1[2,]
matrix1[,1]
ces$average<-apply(media, 1, mean)
ggplot(filter(ces, is.na(education)==F),aes(x=income, y=average))+
geom_point(aes(col=education))+
facet_wrap(~gender)+
geom_smooth(method='lm', aes(col=education))
ces %>%
filter(is.na(education)==F) %>%
filter(income< 150) %>%
ggplot(., aes(x=income, y=average))+
geom_point(aes(col=education))+
facet_wrap(~gender)+
geom_smooth(method='lm', aes(col=education))
Smooth
ggplot(ces.out, aes(x=income, y=Days))+
geom_point()+
facet_wrap(~Media)+
geom_smooth(method='loess')